~ chicken-core (chicken-5) /manual/Module (chicken irregex)
Trap1[[tags: manual]]2[[toc:]]34== Module (chicken irregex)56This module provides support for regular expressions, using the7powerful ''irregex'' regular expression engine by Alex Shinn. It8supports both POSIX syntax with various (irregular) PCRE extensions,9as well as SCSH's SRE syntax, with various aliases for commonly used10patterns. DFA matching is used when possible, otherwise a11closure-compiled NFA approach is used. Matching may be performed over12standard Scheme strings, or over arbitrarily chunked streams of13strings.1415On systems that support dynamic loading, the {{irregex}} module can be16made available in the CHICKEN interpreter ({{csi}}) by entering1718<enscript highlight=scheme>19(import (chicken irregex))20</enscript>2122=== Procedures2324==== irregex25==== string->irregex26==== sre->irregex2728<procedure>(irregex <posix-string-or-sre> [<options> ...])</procedure><br>29<procedure>(string->irregex <posix-string> [<options> ...])</procedure><br>30<procedure>(sre->irregex <sre> [<options> ...])</procedure><br>3132Compiles a regular expression from either a POSIX-style regular33expression string (with most PCRE extensions) or an SCSH-style SRE.34There is no {{(rx ...)}} syntax - just use normal Scheme lists, with35{{quasiquote}} if you like.3637Technically a string by itself could be considered a valid (though38rather silly) SRE, so if you want to just match a literal string you39should use something like {{(irregex `(: ,str))}}, or use the explicit40{{(sre->irregex str)}}.4142The options are a list of any of the following symbols:4344; {{'i}}, {{'case-insensitive}} : match case-insensitively45; {{'m}}, {{'multi-line}} : treat string as multiple lines (effects {{^}} and {{$}})46; {{'s}}, {{'single-line}} : treat string as a single line ({{.}} can match newline)47; {{'utf8}} : utf8-mode (assumes strings are byte-strings)48; {{'fast}} : try to optimize the regular expression49; {{'small}} : try to compile a smaller regular expression50; {{'backtrack}} : enforce a backtracking implementation5152The {{'fast}} and {{'small}} options are heuristic guidelines and will53not necessarily make the compiled expression faster or smaller.5455==== string->sre56==== maybe-string->sre5758<procedure>(string->sre <str>)</procedure><br>59<procedure>(maybe-string->sre <obj>)</procedure><br>6061For backwards compatibility, procedures to convert a POSIX string into62an SRE.6364{{maybe-string->sre}} does the same thing, but only if the argument is65a string, otherwise it assumes {{<obj>}} is an SRE and returns it66as-is. This is useful when you want to provide an API that allows67either a POSIX string or SRE (like {{irregex}} or {{irregex-search}}68below) - it ensures the result is an SRE.6970==== glob->sre7172<procedure>(glob->sre <str>)</procedure>7374Converts a basic shell-style glob to an SRE which matches only strings75which the glob would match. The glob characters {{[}}, {{]}} {{*}}76and {{?}} are supported.777879==== irregex?8081<procedure>(irregex? <obj>)</procedure>8283Returns {{#t}} iff the object is a regular expression.8485==== irregex-search8687<procedure>(irregex-search <irx> <str> [<start> <end>])</procedure>8889Searches for any instances of the pattern {{<irx>}} (a POSIX string, SRE90sexp, or pre-compiled regular expression) in {{<str>}}, optionally between91the given range. If a match is found, returns a match object,92otherwise returns {{#f}}.9394Match objects can be used to query the original range of the string or95its submatches using the {{irregex-match-*}} procedures below.9697Examples:9899<enscript highlight=scheme>100(irregex-search "foobar" "abcFOOBARdef") => #f101102(irregex-search (irregex "foobar" 'i) "abcFOOBARdef") => #<match>103104(irregex-search '(w/nocase "foobar") "abcFOOBARdef") => #<match>105</enscript>106107Note, the actual match result is represented by a vector in the108default implementation. Throughout this manual, we'll just write109{{#<match>}} to show that a successful match was returned when the110details are not important.111112Matching follows the POSIX leftmost, longest semantics, when113searching. That is, of all possible matches in the string,114{{irregex-search}} will return the match at the first position115(leftmost). If multiple matches are possible from that same first116position, the longest match is returned.117118==== irregex-match119==== irregex-match?120121<procedure>(irregex-match <irx> <str> [<start> <end>])</procedure>122<procedure>(irregex-match? <irx> <str> [<start> <end>])</procedure>123124Like {{irregex-search}}, but performs an anchored match against the125beginning and end of the substring specified by {{<start>}} and126{{<end>}}, without searching.127128Where {{irregex-match}} returns a match object, {{irregex-match?}}129just returns a boolean indicating whether it matched or not.130131Examples:132133<enscript highlight=scheme>134(irregex-match '(w/nocase "foobar") "abcFOOBARdef") => #f135136(irregex-match '(w/nocase "foobar") "FOOBAR") => #<match>137</enscript>138139==== irregex-match-data?140141<procedure>(irregex-match-data? <obj>)</procedure>142143Returns {{#t}} iff the object is a successful match result from144{{irregex-search}} or {{irregex-match}}.145146==== irregex-num-submatches147==== irregex-match-num-submatches148149<procedure>(irregex-num-submatches <irx>)</procedure><br>150<procedure>(irregex-match-num-submatches <match>)</procedure>151152Returns the number of numbered submatches that are defined in the153irregex or match object.154155==== irregex-names156==== irregex-match-names157158<procedure>(irregex-names <irx>)</procedure><br>159<procedure>(irregex-match-names <match>)</procedure>160161Returns an association list of named submatches that are defined in162the irregex or match object. The {{car}} of each item in this list is163the name of a submatch, the {{cdr}} of each item is the numerical164submatch corresponding to this name. If a named submatch occurs165multiple times in the irregex, it will also occur multiple times in166this list.167168==== irregex-match-valid-index?169170<procedure>(irregex-match-valid-index? <match> <index-or-name>)</procedure><br>171172Returns {{#t}} iff the {{index-or-name}} named submatch or index is173defined in the {{match}} object.174175==== irregex-match-substring176==== irregex-match-start-index177==== irregex-match-end-index178179<procedure>(irregex-match-substring <match> [<index-or-name>])</procedure><br>180<procedure>(irregex-match-start-index <match> [<index-or-name>])</procedure><br>181<procedure>(irregex-match-end-index <match> [<index-or-name>])</procedure>182183Fetches the matched substring (or its start or end offset) at the184given submatch index, or named submatch. The entire match is index 0,185the first 1, etc. The default is index 0.186187Returns {{#f}} if the given submatch did not match the source string (can happen when you have the submatch inside an {{or}} alternative, for example).188189==== irregex-match-subchunk190==== irregex-match-start-chunk191==== irregex-match-end-chunk192193<procedure>(irregex-match-subchunk <match> [<index-or-name>])</procedure>194<procedure>(irregex-match-start-chunk <match> [<index-or-name>])</procedure>195<procedure>(irregex-match-end-chunk <match> [<index-or-name>])</procedure>196197Access the chunks delimiting the submatch index, or named submatch.198199{{irregex-match-subchunk}} generates a chunked data-type for the given200match item, of the same type as the underlying chunk type (see Chunked201String Matching below). This is only available if the chunk type202specifies the get-subchunk API, otherwise an error is raised.203204Returns {{#f}} if the given submatch did not match the source string (can happen when you have the submatch inside an {{or}} alternative, for example).205206==== irregex-replace207==== irregex-replace/all208209<procedure>(irregex-replace <irx> <str> [<replacements> ...])</procedure><br>210<procedure>(irregex-replace/all <irx> <str> [<replacements> ...])</procedure>211212Matches a pattern in a string, and replaces it with a (possibly empty)213list of substitutions. Each {{<replacement>}} can be either a string214literal, a numeric index, a symbol (as a named submatch), or a215procedure which takes one argument (the match object) and returns a216string.217218Examples:219220<enscript highlight=scheme>221(irregex-replace "[aeiou]" "hello world" "*") => "h*llo world"222223(irregex-replace/all "[aeiou]" "hello world" "*") => "h*ll* w*rld"224225(irregex-replace/all '(* "foo ") "foo foo platter" "*") => "**p*l*a*t*t*e*r"226227(irregex-replace "(.)(.)" "ab" 2 1 "*") => "ba*"228229(irregex-replace "...bar" "xxfoobar" (lambda (m)230 (string-reverse (irregex-match-substring m)))) => "xxraboof"231232(irregex-replace "(...)(bar)" "xxfoobar" 2 (lambda (m)233 (string-reverse (irregex-match-substring m 1)))) => "xxbaroof"234</enscript>235236==== irregex-split237==== irregex-extract238239<procedure>(irregex-split <irx> <str> [<start> <end>])</procedure><br>240<procedure>(irregex-extract <irx> <str> [<start> <end>])</procedure>241242{{irregex-split}} splits the string {{<str>}} into substrings divided243by the pattern in {{<irx>}}. {{irregex-extract}} does the opposite,244returning a list of each instance of the pattern matched disregarding245the substrings in between.246247Empty matches will result in subsequent single character string in248{{irregex-split}}, or empty strings in {{irregex-extract}}.249250<enscript highlight="scheme">251(irregex-split "[aeiou]*" "foobarbaz") => '("f" "b" "r" "b" "z")252253(irregex-extract "[aeiou]*" "foobarbaz") => '("" "oo" "" "a" "" "" "a" "")254</enscript>255256257==== irregex-fold258259<procedure>(irregex-fold <irx> <kons> <knil> <str> [<finish> <start> <end>])</procedure>260261This performs a fold operation over every non-overlapping place262{{<irx>}} occurs in the string {{str}}.263264The {{<kons>}} procedure takes the following signature:265266<enscript highlight=scheme>267(<kons> <from-index> <match> <seed>)268</enscript>269270where {{<from-index>}} is the index from where we started searching271(initially {{<start>}} and thereafter the end index of the last272match), {{<match>}} is the resulting match-data object, and {{<seed>}}273is the accumulated fold result starting with {{<knil>}}.274275The rationale for providing the {{<from-index>}} (which is not276provided in the SCSH {{regexp-fold}} utility), is because this277information is useful (e.g. for extracting the unmatched portion of278the string before the current match, as needed in279{{irregex-replace/all}}), and not otherwise directly accessible.280281Note when the pattern matches an empty string, to avoid an infinite282loop we continue from one char after the end of the match (as opposed283to the end in the normal case). The {{<from-index>}} passed to284the subsequent \scheme{<kons>} or {{<finish>}} still refers to285the original previous match end, however, so {{irregex-split}}286and {{irregex-replace/all}}, etc. do the right thing.287288The optional {{<finish>}} takes two arguments:289290<enscript highlight=scheme>291(<finish> <from-index> <seed>)292</enscript>293294which simiarly allows you to pick up the unmatched tail of the string,295and defaults to just returning the {{<seed>}}.296297{{<start>}} and {{<end>}} are numeric indices letting you specify the298boundaries of the string on which you want to fold.299300To extract all instances of a match out of a string, you can use301302<enscript highlight=scheme>303(map irregex-match-substring304 (irregex-fold <irx>305 (lambda (i m s) (cons m s))306 '()307 <str>308 (lambda (i s) (reverse s))))309</enscript>310311Note if an empty match is found {{<kons>}} will be called on that312empty string, and to avoid an infinite loop matching will resume at313the next char. It is up to the programmer to do something sensible314with the skipped char in this case.315316317=== Extended SRE Syntax318319Irregex provides the first native implementation of SREs (Scheme320Regular Expressions), and includes many extensions necessary both for321minimal POSIX compatibility, as well as for modern extensions found in322libraries such as PCRE.323324The following table summarizes the SRE syntax, with detailed325explanations following.326327 ;; basic patterns328 <string> ; literal string329 (seq <sre> ...) ; sequence330 (: <sre> ...)331 (or <sre> ...) ; alternation332333 ;; optional/multiple patterns334 (? <sre> ...) ; 0 or 1 matches335 (* <sre> ...) ; 0 or more matches336 (+ <sre> ...) ; 1 or more matches337 (= <n> <sre> ...) ; exactly <n> matches338 (>= <n> <sre> ...) ; <n> or more matches339 (** <from> <to> <sre> ...) ; <n> to <m> matches340 (?? <sre> ...) ; non-greedy (non-greedy) pattern: (0 or 1)341 (*? <sre> ...) ; non-greedy kleene star342 (**? <from> <to> <sre> ...) ; non-greedy range343344 ;; submatch patterns345 (submatch <sre> ...) ; numbered submatch346 ($ <sre> ...)347 (submatch-named <name> <sre> ...) ; named submatch348 (=> <name> <sre> ...)349 (backref <n-or-name>) ; match a previous submatch350351 ;; toggling case-sensitivity352 (w/case <sre> ...) ; enclosed <sre>s are case-sensitive353 (w/nocase <sre> ...) ; enclosed <sre>s are case-insensitive354355 ;; character sets356 <char> ; singleton char set357 (<string>) ; set of chars358 (or <cset-sre> ...) ; set union359 (~ <cset-sre> ...) ; set complement (i.e. [^...])360 (- <cset-sre> ...) ; set difference361 (& <cset-sre> ...) ; set intersection362 (/ <range-spec> ...) ; pairs of chars as ranges363364 ;; named character sets365 any366 nonl367 ascii368 lower-case lower369 upper-case upper370 alphabetic alpha371 numeric num372 alphanumeric alphanum alnum373 punctuation punct374 graphic graph375 whitespace white space376 printing print377 control cntrl378 hex-digit xdigit379380 ;; assertions and conditionals381 bos eos ; beginning/end of string382 bol eol ; beginning/end of line383 bow eow ; beginning/end of word384 nwb ; non-word-boundary385 (look-ahead <sre> ...) ; zero-width look-ahead assertion386 (look-behind <sre> ...) ; zero-width look-behind assertion387 (neg-look-ahead <sre> ...) ; zero-width negative look-ahead assertion388 (neg-look-behind <sre> ...) ; zero-width negative look-behind assertion389 (atomic <sre> ...) ; for (?>...) independent patterns390 (if <test> <pass> [<fail>]) ; conditional patterns391 commit ; don't backtrack beyond this (i.e. cut)392393 ;; backwards compatibility394 (posix-string <string>) ; embed a POSIX string literal395396==== Basic SRE Patterns397398The simplest SRE is a literal string, which matches that string399exactly.400401<enscript highlight=scheme>402(irregex-search "needle" "hayneedlehay") => #<match>403</enscript>404405By default the match is case-sensitive, though you can control this406either with the compiler flags or local overrides:407408<enscript highlight=scheme>409(irregex-search "needle" "haynEEdlehay") => #f410411(irregex-search (irregex "needle" 'i) "haynEEdlehay") => #<match>412413(irregex-search '(w/nocase "needle") "haynEEdlehay") => #<match>414</enscript>415416You can use {{w/case}} to switch back to case-sensitivity inside a417{{w/nocase}} or when the SRE was compiled with {{'i}}:418419<enscript highlight=scheme>420(irregex-search '(w/nocase "SMALL" (w/case "BIG")) "smallBIGsmall") => #<match>421422(irregex-search '(w/nocase "small" (w/case "big")) "smallBIGsmall") => #f423</enscript>424425''Important:'' characters outside the ASCII range (ie, UTF8 chars) are426'''not''' matched case insensitively!427428Of course, literal strings by themselves aren't very interesting429regular expressions, so we want to be able to compose them. The most430basic way to do this is with the {{seq}} operator (or its abbreviation431{{:}}), which matches one or more patterns consecutively:432433<enscript highlight=scheme>434(irregex-search '(: "one" space "two" space "three") "one two three") => #<match>435</enscript>436437As you may have noticed above, the {{w/case}} and {{w/nocase}}438operators allowed multiple SREs in a sequence - other operators that439take any number of arguments (e.g. the repetition operators below)440allow such implicit sequences.441442To match any one of a set of patterns use the {{or}} alternation443operator:444445<enscript highlight=scheme>446(irregex-search '(or "eeney" "meeney" "miney") "meeney") => #<match>447448(irregex-search '(or "eeney" "meeney" "miney") "moe") => #f449</enscript>450451==== SRE Repetition Patterns452453There are also several ways to control the number of times a pattern454is matched. The simplest of these is {{?}} which just optionally455matches the pattern:456457<enscript highlight=scheme>458(irregex-search '(: "match" (? "es") "!") "matches!") => #<match>459460(irregex-search '(: "match" (? "es") "!") "match!") => #<match>461462(irregex-search '(: "match" (? "es") "!") "matche!") => #f463</enscript>464465To optionally match any number of times, use {{*}}, the Kleene star:466467<enscript highlight=scheme>468(irregex-search '(: "<" (* (~ #\>)) ">") "<html>") => #<match>469470(irregex-search '(: "<" (* (~ #\>)) ">") "<>") => #<match>471472(irregex-search '(: "<" (* (~ #\>)) ">") "<html") => #f473</enscript>474475Often you want to match any number of times, but at least one time is476required, and for that you use {{+}}:477478<enscript highlight=scheme>479(irregex-search '(: "<" (+ (~ #\>)) ">") "<html>") => #<match>480481(irregex-search '(: "<" (+ (~ #\>)) ">") "<a>") => #<match>482483(irregex-search '(: "<" (+ (~ #\>)) ">") "<>") => #f484</enscript>485486More generally, to match at least a given number of times, use {{>=}}:487488<enscript highlight=scheme>489(irregex-search '(: "<" (>= 3 (~ #\>)) ">") "<table>") => #<match>490491(irregex-search '(: "<" (>= 3 (~ #\>)) ">") "<pre>") => #<match>492493(irregex-search '(: "<" (>= 3 (~ #\>)) ">") "<tr>") => #f494</enscript>495496To match a specific number of times exactly, use {{=}}:497498<enscript highlight=scheme>499(irregex-search '(: "<" (= 4 (~ #\>)) ">") "<html>") => #<match>500501(irregex-search '(: "<" (= 4 (~ #\>)) ">") "<table>") => #f502</enscript>503504And finally, the most general form is {{**}} which specifies a range505of times to match. All of the earlier forms are special cases of this.506507<enscript highlight=scheme>508(irregex-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.168.1.10") => #<match>509510(irregex-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.0168.1.10") => #f511</enscript>512513There are also so-called "non-greedy" variants of these repetition514operators, by convention suffixed with an additional {{?}}. Since the515normal repetition patterns can match any of the allotted repetition516range, these operators will match a string if and only if the normal517versions matched. However, when the endpoints of which submatch518matched where are taken into account (specifically, all matches when519using irregex-search since the endpoints of the match itself matter),520the use of a non-greedy repetition can change the result.521522So, whereas {{?}} can be thought to mean "match or don't match,"523{{??}} means "don't match or match." {{*}} typically consumes as much524as possible, but {{*?}} tries first to match zero times, and only525consumes one at a time if that fails. If you have a greedy operator526followed by a non-greedy operator in the same pattern, they can527produce surprisins results as they compete to make the match longer or528shorter. If this seems confusing, that's because it is. Non-greedy529repetitions are defined only in terms of the specific backtracking530algorithm used to implement them, which for compatibility purposes531always means the Perl algorithm. Thus, when using these patterns you532force IrRegex to use a backtracking engine, and can't rely on533efficient execution.534535==== SRE Character Sets536537Perhaps more common than matching specific strings is matching any of538a set of characters. You can use the {{or}} alternation pattern on a539list of single-character strings to simulate a character set, but this540is too clumsy for everyday use so SRE syntax allows a number of541shortcuts.542543A single character matches that character literally, a trivial544character class. More conveniently, a list holding a single element545which is a string refers to the character set composed of every546character in the string.547548<enscript highlight=scheme>549(irregex-match '(* #\-) "---") => #<match>550551(irregex-match '(* #\-) "-_-") => #f552553(irregex-match '(* ("aeiou")) "oui") => #<match>554555(irregex-match '(* ("aeiou")) "ouais") => #f556</enscript>557558Ranges are introduced with the {{/}} operator. Any strings or559characters in the {{/}} are flattened and then taken in pairs to560represent the start and end points, inclusive, of character ranges.561562<enscript highlight=scheme>563(irregex-match '(* (/ "AZ09")) "R2D2") => #<match>564565(irregex-match '(* (/ "AZ09")) "C-3PO") => #f566</enscript>567568In addition, a number of set algebra operations are provided. {{or}},569of course, has the same meaning, but when all the options are570character sets it can be thought of as the set union operator. This571is further extended by the {{&}} set intersection, {{-}} set572difference, and {{~}} set complement operators.573574<enscript highlight=scheme>575(irregex-match '(* (& (/ "az") (~ ("aeiou")))) "xyzzy") => #<match>576577(irregex-match '(* (& (/ "az") (~ ("aeiou")))) "vowels") => #f578579(irregex-match '(* (- (/ "az") ("aeiou"))) "xyzzy") => #<match>580581(irregex-match '(* (- (/ "az") ("aeiou"))) "vowels") => #f582</enscript>583584==== SRE Assertion Patterns585586There are a number of times it can be useful to assert something about587the area around a pattern without explicitly making it part of the588pattern. The most common cases are specifically anchoring some589pattern to the beginning or end of a word or line or even the whole590string. For example, to match on the end of a word:591592<enscript highlight=scheme>593(irregex-search '(: "foo" eow) "foo") => #<match>594595(irregex-search '(: "foo" eow) "foo!") => #<match>596597(irregex-search '(: "foo" eow) "foof") => #f598</enscript>599600The {{bow}}, {{bol}}, {{eol}}, {{bos}} and {{eos}} work similarly.601{{nwb}} asserts that you are not in a word-boundary - if replaced for602{{eow}} in the above examples it would reverse all the results.603604There is no {{wb}}, since you tend to know from context whether it605would be the beginning or end of a word, but if you need it you can606always use {{(or bow eow)}}.607608Somewhat more generally, Perl introduced positive and negative609look-ahead and look-behind patterns. Perl look-behind patterns are610limited to a fixed length, however the IrRegex versions have no such611limit.612613<enscript highlight=scheme>614(irregex-search '(: "regular" (look-ahead " expression"))615 "regular expression")616 => #<match>617</enscript>618619The most general case, of course, would be an {{and}} pattern to620complement the {{or}} pattern - all the patterns must match or the621whole pattern fails. This may be provided in a future release,622although it (and look-ahead and look-behind assertions) are unlikely623to be compiled efficiently.624625==== SRE Utility Patterns626627The following utility regular expressions are also provided for common628patterns that people are eternally reinventing. They are not629necessarily the official patterns matching the RFC definitions of the630given data, because of the way that such patterns tend to be used.631There are three general usages for regexps:632633; searching : search for a pattern matching a desired object in a larger text634635; validation : determine whether an entire string matches a pattern636637; extraction : given a string already known to be valid, extract certain fields from it as submatches638639In some cases, but not always, these will overlap. When they are640different, {{irregex-search}} will naturally always want the searching641version, so IrRegex provides that version.642643As an example where these might be different, consider a URL. If you644want to match all the URLs in some arbitrary text, you probably want645to exclude a period or comma at the tail end of a URL, since it's more646likely being used as punctuation rather than part of the URL, despite647the fact that it would be valid URL syntax.648649Another problem with the RFC definitions is the standard itself may650have become irrelevant. For example, the pattern IrRegex provides for651email addresses doesn't match quoted local parts (e.g.652{{"first last"@domain.com}}) because these are increasingly rare, and653unsupported by enough software that it's better to discourage their use.654Conversely, technically consecutive periods655(e.g. {{first..last@domain.com}}) are not allowed in email addresses, but656most email software does allow this, and in fact such addresses are657quite common in Japan.658659The current patterns provided are:660661 newline ; general newline pattern (crlf, cr, lf)662 integer ; an integer663 real ; a real number (including scientific)664 string ; a "quoted" string665 symbol ; an R5RS Scheme symbol666 ipv4-address ; a numeric decimal ipv4 address667 ipv6-address ; a numeric hexadecimal ipv6 address668 domain ; a domain name669 email ; an email address670 http-url ; a URL beginning with https?://671672Because of these issues the exact definitions of these patterns are673subject to be changed, but will be documented clearly when they are674finalized. More common patterns are also planned, but as what you675want increases in complexity it's probably better to use a real676parser.677678=== Supported PCRE Syntax679680Since the PCRE syntax is so overwhelming complex, it's easier to just681list what we *don't* support for now. Refer to the682[[http://pcre.org/pcre.txt|PCRE documentation]] for details. You683should be using the SRE syntax anyway!684685Unicode character classes ({{\P}}) are not supported, but will be686in an upcoming release. {{\C}} named characters are not supported.687688Callbacks, subroutine patterns and recursive patterns are not689supported. ({{*FOO}}) patterns are not supported and may never be.690691{{\G}} and {{\K}} are not supported.692693Octal character escapes are not supported because they are ambiguous694with back-references - just use hex character escapes.695696Other than that everything should work, including named submatches,697zero-width assertions, conditional patterns, etc.698699In addition, {{\<}} and {{\>}} act as beginning-of-word and end-of-word700marks, respectively, as in Emacs regular expressions.701702Also, two escapes are provided to embed SRE patterns inside PCRE703strings, {{"\'<sre>"}} and {{"(*'<sre>)"}}. For example, to match a704comma-delimited list of integers you could use705706<enscript highlight=scheme>707"\\'integer(,\\'integer)*"708</enscript>709710and to match a URL in angle brackets you could use711712<enscript highlight=scheme>713"<('*http-url)>"714</enscript>715716Note in the second example the enclosing {{"('*...)"}} syntax is needed717because the Scheme reader would consider the closing {{">"}} as part of718the SRE symbol.719720The following chart gives a quick reference from PCRE form to the SRE721equivalent:722723 ;; basic syntax724 "^" ;; bos (or eos inside (?m: ...))725 "$" ;; eos (or eos inside (?m: ...))726 "." ;; nonl727 "a?" ;; (? a)728 "a*" ;; (* a)729 "a+" ;; (+ a)730 "a??" ;; (?? a)731 "a*?" ;; (*? a)732 "a+?" ;; (+? a)733 "a{n,m}" ;; (** n m a)734735 ;; grouping736 "(...)" ;; (submatch ...)737 "(?:...)" ;; (: ...)738 "(?i:...)" ;; (w/nocase ...)739 "(?-i:...)" ;; (w/case ...)740 "(?<name>...)" ;; (=> <name>...)741742 ;; character classes743 "[aeiou]" ;; ("aeiou")744 "[^aeiou]" ;; (~ "aeiou")745 "[a-z]" ;; (/ "az") or (/ "a" "z")746 "[[:alpha:]]" ;; alpha747748 ;; assertions749 "(?=...)" ;; (look-ahead ...)750 "(?!...)" ;; (neg-look-ahead ...)751 "(?<=...)" ;; (look-behind ...)752 "(?<!...)" ;; (neg-look-behind ...)753 "(?(test)pass|fail)" ;; (if test pass fail)754 "(*COMMIT)" ;; commit755756=== Chunked String Matching757758It's often desirable to perform regular expression matching over759sequences of characters not represented as a single string. The most760obvious example is a text-buffer data structure, but you may also want761to match over lists or trees of strings (i.e. ropes), over only762certain ranges within a string, over an input port, etc. With763existing regular expression libraries, the only way to accomplish this764is by converting the abstract sequence into a freshly allocated765string. This can be expensive, or even impossible if the object is a766text-buffer opened onto a 500MB file.767768IrRegex provides a chunked string API specifically for this purpose.769You define a chunking API with {{make-irregex-chunker}}:770771==== make-irregex-chunker772773<procedure>(make-irregex-chunker <get-next> <get-string> [<get-start> <get-end> <get-substring> <get-subchunk>])</procedure>774775where776777{{(<get-next> chunk) => }} returns the next chunk, or {{#f}} if there are no more chunks778779{{(<get-string> chunk) => }} a string source for the chunk780781{{(<get-start> chunk) => }} the start index of the result of {{<get-string>}} (defaults to always 0)782783{{(<get-end> chunk) => }} the end (exclusive) of the string (defaults to {{string-length}} of the source string)784785{{(<get-substring> cnk1 i cnk2 j) => }} a substring for the range between the chunk {{cnk1}} starting at index {{i}} and ending at {{cnk2}} at index {{j}}786787{{(<get-subchunk> cnk1 i cnk2 j) => }} as above but returns a new chunked data type instead of a string (optional)788789There are two important constraints on the {{<get-next>}} procedure.790It must return an {{eq?}} identical object when called multiple times791on the same chunk, and it must not return a chunk with an empty string792(start == end). This second constraint is for performance reasons -793we push the work of possibly filtering empty chunks to the chunker794since there are many chunk types for which empty strings aren't795possible, and this work is thus not needed. Note that the initial796chunk passed to match on is allowed to be empty.797798{{<get-substring>}} is provided for possible performance improvements799- without it a default is used. {{<get-subchunk>}} is optional -800without it you may not use {{irregex-match-subchunk}} described above.801802You can then match chunks of these types with the following803procedures:804805==== irregex-search/chunked806==== irregex-match/chunked807808<procedure>(irregex-search/chunked <irx> <chunker> <chunk> [<start>])</procedure><br>809<procedure>(irregex-match/chunked <irx> <chunker> <chunk> [<start>])</procedure>810811These return normal match-data objects.812813Example:814815To match against a simple, flat list of strings use:816817<enscript highlight=scheme>818 (define (rope->string rope1 start rope2 end)819 (if (eq? rope1 rope2)820 (substring (car rope1) start end)821 (let loop ((rope (cdr rope1))822 (res (list (substring (car rope1) start))))823 (if (eq? rope rope2)824 (string-concatenate-reverse ; from SRFI-13825 (cons (substring (car rope) 0 end) res))826 (loop (cdr rope) (cons (car rope) res))))))827828 (define rope-chunker829 (make-irregex-chunker (lambda (x) (and (pair? (cdr x)) (cdr x)))830 car831 (lambda (x) 0)832 (lambda (x) (string-length (car x)))833 rope->string))834835 (irregex-search/chunked <pat> rope-chunker <list-of-strings>)836</enscript>837838Here we are just using the default start, end and substring behaviors,839so the above chunker could simply be defined as:840841<enscript highlight=scheme>842 (define rope-chunker843 (make-irregex-chunker (lambda (x) (and (pair? (cdr x)) (cdr x))) car))844</enscript>845846==== irregex-fold/chunked847848<procedure>(irregex-fold/chunked <irx> <kons> <knil> <chunker> <chunk> [<finish> [<start-index>]])</procedure>849850Chunked version of {{irregex-fold}}.851852=== Utilities853854The following procedures are also available.855856==== irregex-quote857858<procedure>(irregex-quote <str>)</procedure>859860Returns a new string with any special regular expression characters861escaped, to match the original string literally in POSIX regular862expressions.863864==== irregex-opt865866<procedure>(irregex-opt <list-of-strings>)</procedure>867868Returns an optimized SRE matching any of the literal strings869in the list, like Emacs' {{regexp-opt}}. Note this optimization870doesn't help when irregex is able to build a DFA.871872==== sre->string873874<procedure>(sre->string <sre>)</procedure>875876Convert an SRE to a PCRE-style regular expression string, if877possible.878879880---881Previous: [[Module (chicken io)]]882883Next: [[Module (chicken keyword)]]